Goto

Collaborating Authors

 term document matrix


information retrieval document search using vector space model in R

#artificialintelligence

Now calculate cosine similarity between each document and each query. For each query sort the cosine similarity scores for all the documents and take top-3 documents having high scores.


R

#artificialintelligence

If only love were so simple – How to graph a heart using R from a fun site called Date By Number. I highly recommend checking it out. If only love were so simple – How to graph a heart using R from a fun site called Date By Number. I highly recommend checking it out.


Semantic analysis of webpages with machine learning in Go · James Bowman

#artificialintelligence

I spend a lot of time reading articles on the internet and started wondering whether I could develop software to automatically discover and recommend articles relevant to my interests. There are various aspects to this problem but I have decided to concentrate first on the core part of the problem: the analysis and classification of the articles. To illustrate the problem, lets consider the following string representing an article for the purpose of this example. We will attempt to use this article as a query to find similar or related articles from the following set of strings (usually referred to as a'corpus'), where each string also represents an article. The approaches we will consider for this example will work with any type of query equally whether the query is itself an article as above or simply a short string of words.


Optimising algorithms in Go for machine learning - Part 2 · James Bowman

#artificialintelligence

This is the second in a series of blog posts sharing my experiences working with algorithms and data structures for machine learning. These experiences were gained whilst building out the nlp project for LSA (Latent Semantic Analysis) of text documents. In Part 1 of this series, I explored alternative approaches for representing and applying TF-IDF transforms for weighting term frequencies across document corpora. We tested the approaches using Go's inbuilt benchmark functionality and found that our optimisations materially improved not just memory consumption but also performance (reducing memory consumption and processing time from 7 GB and 41 seconds to 250 KB and 0.8 seconds respectively). In this blog post I shall explore other areas for optimisation, seeking to further reduce memory consumption and processing time.


Semantic analysis of webpages with machine learning in Go · James Bowman

#artificialintelligence

I spend a lot of time reading articles on the internet and started wondering whether I could develop software to automatically discover and recommend articles relevant to my interests. There are various aspects to this problem but I have decided to concentrate first on the core part of the problem: the analysis and classification of the articles. To illustrate the problem, lets consider the following string representing an article for the purpose of this example. We will attempt to use this article as a query to find similar or related articles from the following set of strings (usually referred to as a'corpus'), where each string also represents an article. The approaches we will consider for this example will work with any type of query equally whether the query is itself an article as above or simply a short string of words.


A Case Study in Text Mining: Interpreting Twitter Data From World Cup Tweets

Godfrey, Daniel, Johns, Caley, Meyer, Carl, Race, Shaina, Sadek, Carol

arXiv.org Machine Learning

Cluster analysis is a field of data analysis that extracts underlying patterns in data. One application of cluster analysis is in text-mining, the analysis of large collections of text to find similarities between documents. We used a collection of about 30,000 tweets extracted from Twitter just before the World Cup started. A common problem with real world text data is the presence of linguistic noise. In our case it would be extraneous tweets that are unrelated to dominant themes. To combat this problem, we created an algorithm that combined the DBSCAN algorithm and a consensus matrix. This way we are left with the tweets that are related to those dominant themes. We then used cluster analysis to find those topics that the tweets describe. We clustered the tweets using k-means, a commonly used clustering algorithm, and Non-Negative Matrix Factorization (NMF) and compared the results. The two algorithms gave similar results, but NMF proved to be faster and provided more easily interpreted results. We explored our results using two visualization tools, Gephi and Wordle.